Peer Storage (Part 3): Identifying Lost Channel States #3897

adi2011 · 2025-06-28T10:21:48Z

In this PR, we begin serializing the ChannelMonitors and sending them over to determine whether any states were lost upon retrieval.

The next PR will be the final one, where we use FundRecoverer to initiate a force close and potentially go on-chain using a penalty transaction.

Sorry for the delay!

ldk-reviews-bot · 2025-06-28T10:21:51Z

👋 Thanks for assigning @tnull as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

'PeerStorageMonitorHolder' is used to wrap a single ChannelMonitor, here we are adding some fields separetly so that we do not need to read the whole ChannelMonitor to identify if we have lost some states. `PeerStorageMonitorHolderList` is used to keep the list of all the channels which would be sent over the wire inside Peer Storage.

Create a utililty function to prevent code duplication while writing ChannelMonitors. Serialise them inside ChainMonitor::send_peer_storage and send them over. TODO: Peer storage should not cross 64k limit.

Deserialise the ChannelMonitors and compare the data to determine if we have lost some states.

Node should now determine lost states using retrieved peer storage.

codecov · 2025-06-29T05:13:45Z

Codecov Report

Attention: Patch coverage is 54.29864% with 101 lines in your changes missing coverage. Please review.

Project coverage is 88.86%. Comparing base (61a37b1) to head (4c9f3c3).

Files with missing lines	Patch %	Lines
lightning/src/chain/channelmonitor.rs	40.17%	8 Missing and 62 partials ⚠️
lightning/src/ln/channelmanager.rs	58.49%	22 Missing ⚠️
lightning/src/ln/our_peer_storage.rs	70.37%	0 Missing and 8 partials ⚠️
lightning/src/chain/chainmonitor.rs	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3897      +/-   ##
==========================================
- Coverage   88.86%   88.86%   -0.01%     
==========================================
  Files         165      165              
  Lines      118886   118962      +76     
  Branches   118886   118962      +76     
==========================================
+ Hits       105650   105710      +60     
- Misses      10911    10923      +12     
- Partials     2325     2329       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TheBlueMatt · 2025-06-30T01:05:52Z

lightning/src/chain/channelmonitor.rs

+///
+/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
+#[rustfmt::skip]
+pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {


@wpaulino what do you think we should reasonably cut here to reduce the size of a ChannelMonitor without making the emergency-case ChannelMonitors all that different from the regular ones to induce more code changes across channelmonitor.rs? Obviously we should avoid counterparty_claimable_outpoints, but how much code is gonna break in doing so?

Not too familiar with the goals here, but if the idea is for the emergency-case ChannelMonitor to be able to recover funds, wouldn't it need to handle a commitment confirmation from either party? That means we need to track most things, even counterparty_claimable_outpoints (without the sources though) since the counterparty could broadcast a revoked commitment.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

TheBlueMatt · 2025-06-30T01:13:07Z

lightning/src/ln/channelmanager.rs

+							}
+						},
+						None => {
+							// TODO: Figure out if this channel is so old that we have forgotten about it.


There's no need to worry here, I think. If the channel is gone we either haven't fallen behind (probably) or we already broadcasted a stale state (because we broadcast on startup if the channel is gone and we have a ChannelMonitor) at which point were screwed. So nothing to do here.

Thanks for clarifying, I will remove this.

ldk-reviews-bot · 2025-06-30T11:18:09Z

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

tnull

Took a first look, but will hold off with going more into details until we decided on which way we should go with the ChannelMonitor stub,

tnull · 2025-06-30T12:21:17Z

lightning/src/ln/channelmanager.rs

+			},
+
+			Err(e) => {
+				panic!("Wrong serialisation of PeerStorageMonitorHolderList: {}", e);


I don't think we should ever panic in any of this code. Yes, something might be wrong if we have peer storage data we can't read anymore, but really no reason to refuse to at least keep other potential channels operational.

Yes, that makes sense, I think we should only panic if we have determined that we have lost some channel state.

TheBlueMatt

Few more comments, let's move forward without blocking on the ChannelMonitor serialization stuff.

TheBlueMatt · 2025-07-09T22:08:28Z

lightning/src/ln/our_peer_storage.rs

+impl Writeable for PeerStorageMonitorHolderList {
+	fn write<W: Writer>(&self, w: &mut W) -> Result<(), io::Error> {
+		encode_tlv_stream!(w, { (1, &self.monitors, required_vec) });
+		Ok(())
+	}
+}
+
+impl Readable for PeerStorageMonitorHolderList {
+	fn read<R: io::Read>(r: &mut R) -> Result<Self, DecodeError> {
+		let mut monitors: Option<Vec<PeerStorageMonitorHolder>> = None;
+		decode_tlv_stream!(r, { (1, monitors, optional_vec) });
+
+		Ok(PeerStorageMonitorHolderList { monitors: monitors.ok_or(DecodeError::InvalidValue)? })
+	}
+}


You should be able to replace both of these with a single impl_writeable_tlv_based macro call.

Fixed, thanks for this!

TheBlueMatt · 2025-07-09T22:25:29Z

lightning/src/chain/channelmonitor.rs

+///
+/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
+#[rustfmt::skip]
+pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {


Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

TheBlueMatt · 2025-07-09T22:26:56Z

lightning/src/chain/chainmonitor.rs

 		let random_bytes = self.entropy_source.get_secure_random_bytes();
-		let serialised_channels = Vec::new();
+
+		// TODO(aditya): Choose n random channels so that peer storage does not exceed 64k.


This should be pretty easy? We have random bytes, just make an outer loop that selects a random monitor (by doing monitors.iter().skip(random_usize % monitors.len()).next())

TheBlueMatt · 2025-07-09T22:28:32Z

lightning/src/ln/channelmanager.rs

@@ -8807,6 +8808,7 @@ This indicates a bug inside LDK. Please report this error at https://github.com/
 		&self, peer_node_id: PublicKey, msg: msgs::PeerStorageRetrieval,
 	) -> Result<(), MsgHandleErrInternal> {
 		// TODO: Check if have any stale or missing ChannelMonitor.
+		let per_peer_state = self.per_peer_state.read().unwrap();


No need to take the (read) lock at the top, do it after we decrypt.

…orage

tnull

Did a ~first pass.

This needs a rebase now, in particular now that #3922 landed.

tnull · 2025-07-11T08:40:19Z

lightning/src/chain/chainmonitor.rs

@@ -810,10 +813,53 @@ where
 	}

 	fn send_peer_storage(&self, their_node_id: PublicKey) {
-		// TODO: Serialize `ChannelMonitor`s inside `our_peer_storage`.
-
+		static MAX_PEER_STORAGE_SIZE: usize = 65000;


This should be a const rather than static, I think? Also, would probably make sense to add this add the module level, with some docs.

Also isn't the max size 64 KiB, not 65K?

Oh, my bad, thanks for pointing this out, It should be 65531.

tnull · 2025-07-11T08:41:55Z

lightning/src/chain/chainmonitor.rs

 		let random_bytes = self.entropy_source.get_secure_random_bytes();
-		let serialised_channels = Vec::new();
+		let random_usize = usize::from_le_bytes(random_bytes[0..8].try_into().unwrap());


Depending on the platform, a usize might not always be 8 bytes. You'll probably need to do

const USIZE_LEN: usize = core::mem::size_of::<usize>();

and use that instead of 8.

Fixed, thanks for pointing it out.

tnull · 2025-07-11T08:44:09Z

lightning/src/chain/channelmonitor.rs

+/// NOTE: `is_stub` is true only when we are using this to serialise for Peer Storage.
+///
+/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
+#[rustfmt::skip]


Please don't add rustfmt::skip when introducing new code. Might be good to introduce a commit in the beginning that removes the skip from fn write, so that the code changes here are in fact mostly code moves.

tnull · 2025-07-11T08:52:19Z

lightning/src/chain/chainmonitor.rs

+		let mut curr_size = 0;
+
+		// Randomising Keys in the HashMap to fetch monitors without repetition.
+		let mut keys: Vec<&ChannelId> = monitors.keys().collect();


Can we make this a bit cleaner by using the proposed iterator skiping approach in the loop below, maybe while simply keeping track of which monitors we already wrote?

tnull · 2025-07-11T08:53:22Z

lightning/src/chain/chainmonitor.rs

+			let min_seen_secret = mon.monitor.get_min_seen_secret();
+			let counterparty_node_id = mon.monitor.get_counterparty_node_id();
+
+			match write_util(&mon.monitor.inner.lock().unwrap(), true, &mut ser_chan) {


nit: Please move taking the lock out into a dedicated variable. This would also make it easier to spot the scoping of the lock, IMO.

tnull · 2025-07-11T08:56:24Z

lightning/src/chain/chainmonitor.rs

+
+			match write_util(&mon.monitor.inner.lock().unwrap(), true, &mut ser_chan) {
+				Ok(_) => {
+					let mut ser_channel = Vec::new();


I think instead of creating a new Vec and then writing to it, you should be able to just call encode on the PeerStorageMonitorHolder.

But I'm currently confused what we use ser_channel to begin with is it just to calculate the length below? That seems like a big unnecessary allocation? You could use serialized_length for example and keep track of the written bytes and compare them to MAX_PEER_STORAGE_SIZE.

tnull · 2025-07-11T08:57:57Z

lightning/src/chain/chainmonitor.rs

+		}
+
+		let mut serialised_channels = Vec::new();
+		monitors_list.write(&mut serialised_channels).unwrap();


Same here, just use encode.

tnull · 2025-07-11T09:02:55Z

lightning/src/chain/chainmonitor.rs

+					monitors_list.monitors.push(peer_storage_monitor);
+				},
+				Err(_) => {
+					panic!("Can not write monitor for {}", mon.monitor.channel_id())


Really, please avoid these explicit panics in any of this code.

tnull · 2025-07-11T09:08:47Z

lightning/src/chain/channelmonitor.rs

+///
+/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
+#[rustfmt::skip]
+pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {


nit: We often use an internal_ prefix or _internal suffix when splitting out functionality into util methods. This is no strict rule, but given the precedence it might be an easier-to-understand name.

tnull · 2025-07-11T09:14:49Z

lightning/src/ln/channelmanager.rs

+		let per_peer_state = self.per_peer_state.read().unwrap();
+
+		let mut cursor = io::Cursor::new(decrypted);
+		match <PeerStorageMonitorHolderList as Readable>::read(&mut cursor) {


nit: It might be cleaner to split the read call to its own line using Readable::read. Given that we don't want to panic, we can probably also avoid the huge match by using map_err or unwrap_or_else.

ldk-reviews-bot requested a review from joostjager June 28, 2025 10:32

tnull requested review from tnull and removed request for joostjager June 28, 2025 11:17

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 101f31c to a35566a Compare June 29, 2025 05:03

adi2011 added 4 commits June 29, 2025 10:34

Serialise ChannelMonitors and send them over inside Peer Storage

9a0ef54

Create a utililty function to prevent code duplication while writing ChannelMonitors. Serialise them inside ChainMonitor::send_peer_storage and send them over. TODO: Peer storage should not cross 64k limit.

Determine if we have lost data

8d82fd3

Deserialise the ChannelMonitors and compare the data to determine if we have lost some states.

test: Modify test_peer_storage to check latest changes

4c9f3c3

Node should now determine lost states using retrieved peer storage.

adi2011 force-pushed the peer-storage/serialise-deserialise branch from a35566a to 4c9f3c3 Compare June 29, 2025 05:04

TheBlueMatt reviewed Jun 30, 2025

View reviewed changes

tnull reviewed Jun 30, 2025

View reviewed changes

TheBlueMatt reviewed Jul 9, 2025

View reviewed changes

adi2011 added 3 commits July 11, 2025 10:37

fixup: Serialise ChannelMonitors and send them over inside Peer Storage

006a5d0

fixup: Write structs to serialise-deserialise Channels inside Peer-st…

27303ec

…orage

fixup: Determine if we have lost data

676afbc

tnull reviewed Jul 11, 2025

View reviewed changes

Peer Storage (Part 3): Identifying Lost Channel States #3897

Are you sure you want to change the base?

Peer Storage (Part 3): Identifying Lost Channel States #3897

Uh oh!

Conversation

adi2011 commented Jun 28, 2025

Uh oh!

ldk-reviews-bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldk-reviews-bot commented Jun 30, 2025

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldk-reviews-bot commented Jun 28, 2025 •

edited

Loading

codecov bot commented Jun 29, 2025 •

edited

Loading